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, Abstract 

We propose and analyze a new vantage point for the learning of mixtures of Gaussians: 
namely, the PAC-style model of learning probability distributions introduced by Kearns et al. jTSj. 
Here the task is to construct a hypothesis mixture of Gaussians that is statistically indistin- 
guishable from the actual mixture generating the data; specifically, the KL divergence should 
be at most e. 

^\ ' In this scenario, we give a poly(n/e) time algorithm that learns the class of mixtures of 

, any constant number of axis-aligned Gaussians in R". Our algorithm makes no assumptions 

about the separation between the means of the Gaussians, nor does it have any dependence on 
the minimum mixing weight. This is in contrast to learning results known in the "clustering" 
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■ model, where such assumptions are unavoidable. 



Our algorithm relies on the method of moments, and a subalgorithm developed in |S] for 
a discrete mixture-learning problem. 



> 
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■ 1 Introduction 

In Kearns et al. introduced an elegant and natural model of learning unknown probability 
distributions. In this framework we are given a class C of probability distributions over R" and 
access to random data sampled from an unknown distribution Z that belongs to C. The goal is 
to output a hypothesis distribution Z' which with high confidence is e-close to Z as measured by 
the the Kullback-Leibler (KL) divergence, a standard measure of the distance between probability 
distributions (see Section[2lfor details on this distance measure). The learning algorithm should run 
in time poly(n/e). This model is well- motivated by its close analogy to Valiant's classical Probably 
Approximately Correct (PAC) framework for learning Boolean functions jl8j . 

*Some of this work was done while supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship 
at Columbia University. 

^Some of this work was done while at the Institute for Advanced Study. 

* Supported in part by NSF award CCF-0347282, by NSF award CCF-0523664, and by a Sloan Foundation Fel- 
lowship. 



1 



Several notable results, both positive and negative, have been obtained for learning in the 
Kearns et al. framework of see, e.g., jlUl I15j . Here we briefly survey some of the positive 
results that have been obtained for learning various types of mixture distributions. (Recall that 
given distributions X^,...,X'^ and mixing weights 7r^,...,7r'^ that sum to 1, a draw from the 
corresponding mixture distribution is obtained by first selecting i with probability vr* and then 
making a draw from X*.) Kearns et al. gave an efficient algorithm for learning certain mixtures of 
Hamming balls; these are product distributions over {0, 1}" in which each coordinate mean is either 
p or 1 — p for some p fixed over all mixture components. Subsequently Freund and Mansour 
and independently Cryan et al. 0] gave efficient algorithms for learning a mixture of two arbitrary 
product distributions over {0, !}"■. Recently, Feldman et al. ^ gave a poly(n)-time algorithm that 
learns a mixture of any k = 0(1) many arbitrary product distributions over the discrete domain 
{0, 1, . . . , 6 - 1}" for any b = 0(1). 

1.1 Results 

As described above, research on learning mixture distributions in the PAC-style model of Kearns 
et al. has focused on distributions over discrete domains. In this paper we consider the natural 
problem of learning mixtures of Gaussians in the PAC-style framework of • Our main result is 
the following theorem: 

Theorem 1 (Informal version) Fix any k = 0(1), and let Z be any unknown mixture of axis- 
aligned Gaussians over R". There is an algorithm that, given samples from Z and any e, 6 > 
as inputs, runs in time poly(n/e) • log(l/(5) and with probability 1 — 6 outputs a mixture Z' of k 
axis-aligned Gaussians over R" satisfying KL{7i\\7i') < e. 

A signal feature of this result is that it requires no assumptions about the Gaussians being 
"separated" in space. It also has no dependence on the minimum mixing weight. We compare our 
result with other works on learning mixtures of Gaussians in the next section. 

Our proof of Theorem^works by extending the basic approach for learning mixtures of product 
distributions over discrete domains from [S]. The main technical tool introduced in jSj is the WAM 
(Weights And Means) algorithm; the correctness proof of WAM is based on an intricate error 
analysis using ideas from the singular value theory of matrices. In this paper, we use this algorithm 
in a continuous domain to estimate the parameters of the Gaussian mixture. Dealing with this 
more complex class of distributions requires tackling a whole new set of issues around sampling 
error that did not exist in the discrete case. 

Our results strongly suggest that the techniques introduced in [S] (and extended here) extend 
to PAC learning mixtures of other classes of product distributions, both discrete and continuous, 
such as exponential distributions or Poisson distributions. Though we have not explicitly worked 
out those extensions in this paper, we briefly discuss general conditions under which our techniques 
are applicable in Section [7| 

1.2 Comparison with other frameworks for learning mixtures of Gaussians 

There is a vast literature in statistics on modeling with mixture distributions, and on estimating the 
parameters of unknown such distributions from data. The case of mixtures of Gaussians is by far 
the most studied case; see, e.g., jElEl for surveys. Statistical work on mixtures of Gaussians has 
mainly focused on finding the distribution parameters (mixing weights, means, and variances) of 
maximum likelihood, given a set of data. Although one can write down equations whose solutions 
give these maximum likelihood values, solving the equations appears to be a computationally 
intractable problem. In particular, the most popular algorithm used for solving the equations, the 
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EM Algorithm of Dempster et al. [7j, has no efficiency guarantees and may run slowly or converge 
only to local optima on some instances. 

A change in perspective led to the first provably efficient algorithm for learning: In 1999, 
Dasgupta [S] suggested learning in the clustering framework. In this scenario, the learner's goal is 
to group all the sample points according to which Gaussian in the mixture they came from. This is 
the strongest possible criterion for success one could demand; when the learner succeeds, it can easily 
recover accurate approximations of all parameters of the mixture distribution. However, a strong 
assumption is required to get such a strong outcome: it is clear that the learner cannot possibly 
succeed unless the Gaussians are guaranteed to be sufficiently "separated" in space. Informally, it 
must at least be the case that, with high probability, no sample point "looks like" it might have 
come from a different Gaussian in the mixture other than the one that actually generated it. 

Dasgupta gave a polynomial time algorithm that could cluster a mixture of spherical Gaussians 
of equal radius. His algorithm required separation on the order of n^/^ times the standard deviation. 
This was improved to n^/^ by Dasgupta and Schulman and this in turn was significantly 
generalized to the case of completely general (i.e., elliptical) Gaussians by Arora and Kannan 
Another breakthrough came from Vempala and Wang 19] who showed how the separation could 
be reduced, in the case of mixtures of k spherical Gaussians (of different radii), to the order of 
k^/^ times the standard deviation, times factors logarithmic in n. This result was extended to 
mixtures of general Gaussians (indeed, log-concave distributions) in works by Kannan et al. ^2] 
and Achlioptas and McSherry with some slightly worse separation requirements. It should also 
be mentioned that these results all have a running time dependence that is polynomial in l/vTmin, 
where TTmin denotes the minimum mixing weight. 

Our work gives another learning perspective that allows us to deal with mixtures of Gaussians 
that satisfy no separation assumption. In this case clustering is simply not possible; for any data 
set, there may be many different mixtures of Gaussians under which the data are plausible. This 
possibility also leads to the seeming intractability of finding the maximum likelihood mixture of 
Gaussians. Nevertheless, we feel that this case is both interesting and important, and that under 
these circumstances identifying some mixture of Gaussians which is statistically indistinguishable 
from the true mixture is a worthy task. This is precisely what the PAC-style learning scenario we 
work in requires, and what our main algorithm efficiently achieves. 

Reminding the reader that they work in significantly different scenarios, we end this section 
with a comparison between other aspects of our algorithm and algorithms in the clustering model. 
Our algorithm works for mixtures of axis-aligned Gaussians. This is stronger than the case of 
spherical Gaussians considered in [Sj E] i but weaker than the case of general Gaussians handled 
in |21Em- On the other hand, in Section[7|we discuss the fact that our methods should be readily 
adaptable to mixtures of a wide variety of discrete and continuous distributions — essentially, any 
distribution where the "method of moments" from statistics succeeds. The clustering algorithms 
discussed have polynomial running time dependence on k, the number of mixture components, 
whereas our algorithm's running time is polynomial in n only if is a constant. We note that 
in [H] , strong evidence was given that (for the PAC-style learning problem that we consider) such a 
dependence is unavoidable at least in the case of learning mixtures of product distributions on the 
Boolean cube. Finally, unlike the clustering algorithms mentioned, our algorithm has no running 
time dependence on l/vTmin- 

1.3 Overview of the approach and the paper 

An important ingredient of our approach is a slight extension of the WAM algorithm, the main 
technical tool introduced in [H]. The algorithm takes as input a parameter e > and samples 
from an unknown mixture Z of A; product distributions X^, . . . ,X'^ over R". The output of the 
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algorithm is a hst of candidate descriptions of the k mixing weights and kn coordinate means of 
the distributions X^,...,X'^. Roughly speaking, the guarantee for the algorithm proved in |S] is 
that with high probability at least one of the candidate descriptions that the algorithm outputs 
is "good" in the following sense: it is an additive e-accurate approximation to each of the k true 
mixing weights vr^,... ,tt^ and to each of the true coordinate means = E[X*] for which the 
corresponding mixing weight vr* is not too small. We give a precise specification in Section |3J 

As described above, when WAM is run on a mixture distribution it generates candidate esti- 
mates of mixing weights and means. However, to describe a Gaussian we need not only its mean 
but also its variance. To achieve this we run WAM twice, once on Z and once on what might be 
called "Z^" — i.e., for the second run, each time a draw (zi, . . . , Zn) is obtained from Z we convert 
it to {zf,..., z^) and use that instead. It is easy to see that Z^ corresponds to a mixture of the 
distributions (X^)^, . . . , (X'^)^, and thus this second run gives us estimates of the mixing weights 
(again) and also of the coordinate second moments E[(X*)^]. Having thus run WAM twice, we 
essentially take the "cross-product" of the two output lists to obtain a list of candidate descriptions, 
each of which specifies mixing weights, means, and second moments of the component Gaussians. In 
Section|l]we give a detailed description of this process and prove that with high probability at least 
one of the resulting candidates is a "good" description (in the sense of the preceding paragraph) of 
the mixing weights, coordinate means, and coordinate variances of the Gaussians X^, . . . , X'^. 

To actually PAC learn the distribution Z, we must find this good description among the candi- 
dates in the list. A natural idea is to apply some sort of maximum likelihood procedure. However, 
to make this work, we need to guarantee that the list contains a distribution that is close to the tar- 
get in the sense of KL divergence. Thus, in Section |S1 we show how to convert each "parametric" 
candidate description into a mixture of Gaussians such that any additively accurate description 
indeed becomes a mixture distribution with close KL divergence to the unknown target. (This pro- 
cedure also guarantees that the candidate distributions satisfy some other technical conditions that 
are needed by the maximum likelihood procedure.) Finally, in Section|Blwe put the pieces together 
and show how a maximum likelihood procedure can be used to identify a hypothesis mixture of 
Gaussians that has small KL divergence relative to the target mixture. 

Note. This is the full version of [5] which contains all proofs omitted in that conference version 
because of space limitations. 

2 Preliminaries 

The PAC learning framework for probability distributions. We work in the Probably 
Approximately Correct model of learning probability distributions which was proposed by Kearns 
et al. [TFj . In this framework the learning algorithm is given access to samples drawn from the target 
distribution Z to be learned, and the learning algorithm must (with high probability) output an 
accurate approximation Z' of the target distribution Z. Following we use the Kullback-Leibler 
(KL) divergence (also known as the relative entropy) as our notion of distance. The KL divergence 
between distributions Z and Z' is 

KL(Z||Z') := J Z{x)ln{Z{x)/Z'{x))dx 

where here we have identified the distributions with their pdfs. The reader is reminded that KL 
divergence is not symmetric and is thus not a metric. KL divergence is a stringent measure of 
the distance between probability distances. In particular, it holds |SI that < \\Z — Z'\\2 < 
(2 In 2)y^KL(Z||Z'), where || • ||i denotes total variation distance; hence if the KL divergence is 
small then so is the total variation distance. 
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We make the following formal definition: 

Definition 1 Let D be a class of probability distributions owerR". An efficient (proper) learning 
algorithm for T> is an algorithm which, given e, 5 > and samples drawn from any distribution 
Z G runs in poly(n, 1/e, 1/5) time and, with probability at least 1 — 6, outputs a representation 
of a distribution Z' E D such that KL(Z||Z') < e. 

Mixtures of axis-aligned Gaussians. Here we recall some basic definitions and establish useful 
notational conventions for later. 

A Gaussian distribution over R with mean /x and variance a has probability density function 

f{x) = {l/y/Tna) exp ^— ^^2(/2 ^ • An axis-aligned Gaussian over R" is a product distribution over 
n univariate Gaussians. 

If we expect to learn a mixture of Gaussians, we need each Gaussian to have reasonable pa- 
rameters in each of its coordinates. Indeed, consider just the problem of learning the parameters 
of a single one-dimensional Gaussian: If the variance is enormous, we could not expect to estimate 
the mean efficiently; or, if the variance was extremely close to 0, any slight error in the hypothesis 
would lead to a severe penalty in KL divergence. These issues motivate the following definition: 

Definition 2 We say that X is a d-dimensional (//max; f^in; '7max)"bounded Gaussian if X. is a 
d-dimensional axis-aligned Gaussian with the property that each of its one- dimensional coordinate 
Gaussians Xj has mean jij S [— /imaxj /^max] cLnd variance {(Jj)'^ G [c^in, Cmax]- 

Notational convention: Throughout the rest of the paper all Gaussians we consider are (/imax, (^mm^ ^max)" 
bounded, where for notational convenience we assume that the numbers Umax, o"max ^''"^ least 1 
and that the number cr^j^ is at most 1. We will denote by L the quantity /UmaxO'max/o'min; which in 
some sense measures the bit- complexity of the problem. Given distributions X^, . . . jX'^ over R", 
we write to denote E[X*], the j-th coordinate mean of the i-th component distribution, and we 
write (o"*)^ to denote Var[X*], the variance in coordinate j of the i-th distribution. 

A mixture of k axis-aligned Gaussians Z = vriX^ -|- • • • -|- vTfcX'^ is completely specified by the 
parameters vr*, /x*-, and (Cj)^- Our learning algorithm for Gaussians will have a running time that 
depends polynomially on L; thus the algorithm is not strongly polynomial. 

3 Listing candidate weights and means with WAM 

We first recall the basic features of the WAM algorithm from 8_ and then explain the extension 
we require. The algorithm described in |8| takes as input a parameter e > and samples from 
an unknown mixture Z of /c distributions X^, . . . jX'^ where each X* = (X^, . . . ,X^) is assumed 
to be a product distribution over the bounded domain [—1, l]". The goal of WAM is to output 
accurate estimates for the mixing weights vr* and coordinate means ; what the algorithm actually 
outputs is a list of candidate "parametric descriptions" of the means and mixing weights, where 
each candidate description is of the form ({tt^, . . . , tt^}, {fil, fi2, . . . , An})- 

We now explain the notion of a "good" estimate of parameters from Section [1.31 in more detail. 
As motivation, note that if a mixing weight vr* is very low then the WAM algorithm (or indeed 
any algorithm that only draws a limited number of samples from Z) may not receive any samples 
from X*, and thus we would not expect WAM to construct an accurate estimate for the coordinate 
means , . . . , /i^ . We thus have the following definition from jS] : 

Definition 3 A candidate ({vr^, . . . , Tr*"'}, {/ij, /i2, . . . , /i^}) is said to be parametrically e-accurate 
^f■■ 
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1. \7r' - 7r*| < e for all I < i < k; 

2. l/i*- — ^jl < e for all 1 < i < k and 1 < j < n such that vr' > e. 

Very roughly speaking, the WAM algorithm in |Sj works by exhaustively "guessing" (to a 
certain prescribed granularity that depends on e) values for the mixing weights and for k^ of the 
kn coordinate means. Given a guess, the algorithm tries to approximately solve for the remaining 
k{n — k) coordinate means using the guessed values and the sample data; in the course of doing 
this the algorithm uses estimates of the expectations E[ZjZj/] that are obtained from the sample 
data. From each guess the algorithm thus obtains one of the candidates in the list that it ultimately 
outputs. 

The assumption [S] that each distribution X* in the mixture is over [— l,!]" has two nice 
consequences: each coordinate mean need only be guessed within a bounded domain [—1,1], and 
estimating E[ZjZj/] is easy for a mixture Z of such distributions. Inspection of the proof of 
correctness of the WAM algorithm shows that these two conditions are all that is really required. 
We thus introduce the following: 

Definition 4 Let 'X. be a distribution over R. We say that X is A(e, 5)-samplable if there is an 
algorithm A which, given access to draws from X, runs for A(e, 6) steps and outputs (with probability 
at least 1—6 over the draws from Xj a quantity fi satisfying {fi — E[X]| < e. 

With this definition in hand an obvious (slight) generalization of WAM, which we denote 
WAM', suggests itself. The main result about WAM' that we need is the following (the proof is 
essentially identical to the proof in so we omit it): 

Theorem 2 Let Z be a mixture of product distributions X.^ , • • • , X*^ with mixing weights tt^ , . . . , tt^ 
where each /x* = E[X*] satisfies |//*| < U and Z^Zj/ is poly(C//e) • log{l/ 6) -samplable for all 

j 7^ f . Given U and any e,6 > 0, WAM' runs in time {nU / e)^'^^^^ ■ \og{\/5) and outputs a list of 
{nU /e)^^^ ^ many candidates descriptions, at least one of which (with probability at least 1 — 5) is 
parametrically e-accurate. 

4 Listing candidate weights, means, and variances 

Through the rest of the paper we assume that Z is a /c-wise mixture of independent (^max, Cmin' "^max)" 
bounded Gaussians X^ , . . . , X*^ , as discussed in Section [2 Recall also the notation L from that 
section. 

As described in Section [1.31 we will run WAM' twice, once on the original mixture of Gaussians 
Z and once on the squared mixture Z^. In order to do this, we must show that both Z = ttiX^ + 

h TTfcX*^ and 7? = TTipC-f H h HkO^^f satisfy the conditions of Theorem H The bound 

I/Uj i < ^max on coordinate means is satisfied by assumption for Z, and for 7? we have that each 
E[(X* )^] is at most Cmax+/"max- remains to verify the required samplability condition on products 
of two coordinates for both Z and Z^; i.e. we must show that both the random variables ZjZj/ 
are samplable and that the random variables Z^Z^, are samplable. We do this in the following 
proposition, whose straightforward but technical proof appears in Appendix lEl 

Proposition 1 Suppose Z = (Zi,Z2) is the mixture of k two-dimensional (/^max) Cmin' '^max)' 
bounded Gaussians. Then both the random variable W := Z1Z2 and the random variable 
are poly(L/e) • log{l / 6) -samplable. 
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The proof of the following theorem explains precisely how we can run WAM' twice and how 
we can combine the two resulting lists (one containing candidate descriptions consisting of mixing 
weights and coordinate means, the other containing candidate descriptions consisting of mixing 
weights and coordinate second moments) to obtain a single list of candidate descriptions consisting 
of mixing weights, coordinate means, and coordinate variances. 

Theorem 3 Let Z be a mixture of k = 0(1) axis-aligned Gaussians X^, . . . ,X'^ overU"', described 
by parameters ({tt*}, {)U* }, {<t*}). There is an algorithm with the following property: For any e, 
5 > 0, given samples from Z the algorithm runs in poly(nL/e) • log(l/5) time and with probability 
1 — 5 outputs a list o/poly(nL/e) many candidates ({tt*}, {/i* }, {a*}) such that for at least one 
candidate in the list, the following holds: 

1. Itt* — 7r*| < e for all i G [k]; and 

2. l/i*- — //*| < e and |((Tp^ — < e for all i,j such that vr* > e. 

Proof: First run the algorithm WAM' with the random variable Z, taking the parameter "J7" 
in WAM' to be L, taking "(5" to be 6/2, and taking "e" to be e/(6/imax)- By Proposition and 
Theorem |2 this takes at most the claimed running time. WAM' outputs a list Listl of candi- 
date descriptions for the mixing weights and expectations, Listl = [. . . , (vr*, /}*•), . . .], which with 
probability at least 1 — 5/2 contains at least one candidate description which is parametrically 
e/ (6/imax)-accurate. 

Define (Sj)^ = E[(X* )^] = (o"p^-|-(^*)^. Run the algorithm WAM' again on the squared random 
variable Z^, with = cr^ax + ^max; "<^" = V^, and "e" = e/2. By Proposition [U this again takes 
at most the claimed running time. This time WAM' outputs a list List2 of candidates for the 
mixing weights (again) and second moments, List2 = [. . . , (vr*, (s*)^) . . . ], which with probability 
at least 1 — 6/2 has a "good" entry which satisfies 

1. - 7r*| < e/2 for alH = 1 . . . k; and 

2. |(sp2 _ (si.)2| < g/2 for all i,j such that vr' > e/2. 

We now form the "cross product" of the two lists. (Again, this can be done in the claimed 
running time.) Specifically, for each pair consisting of a candidate (vr' , /i* ) in Listl and a candidate 

(tt*, {s'j)'^) in List2, we form a new candidate consisting of mixing weights, means, and variances, 

namely (-n-*,/i*-, (o"*)^) where (<7*)^ = (•Sj)^ — ('"j)^' (^o^e that we simply discard vr*.) 

When the "good" candidate from Listl is matched with the "good" candidate from List2, the 
resulting candidate's mixing weights and means satisfy the desired bounds. For the variances, we 
have that Ko"*)^ — (o"j)^| is at most 

mf - {S^f\ + - < ^ + lAj - ■ lA^ + /^jl < ^ + TT^ • 3^max = e. 

This proves the theorem. ■ 

5 From parametric estimates to bona fide distributions 

At this point we have a list of candidate "parametric" descriptions ({vr*}, {Aj}, {(i^j)^}) of mixtures 
of Gaussians, at least one of which is parametrically accurate in the sense of Theorem |31 In 
Section [5.11 we describe an efficient way to convert any parametric description into a true mixture 
of Gaussians such that: 
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(i) any parametrically accurate description becomes a distribution with close KL divergence to 
the target distribution; and 



(ii) every mixture distribution that results from the conversion has a pdf that satisfies certain 
upper and lower bounds (that will be required for the maximum likelihood procedure). 

The conversion procedure is conceptually straightforward — it essentially just truncates any ex- 
treme parameters to put them in a "reasonable" range — but the details establishing correctness 
are fairly technical. By applying this conversion to each of the parametric descriptions in our list 
from Section 1^ we obtain a list of mixture distribution hypotheses all of which have bounded pdfs 
and at least one of which is close to the target Z in KL divergence (see Section ESj). With such a 
list in hand, we will be able to use maximum likelihood (in Section ISJ to identify a single hypothesis 
which is close in KL divergence. 

5.1 The conversion procedure 

In this section we prove: 



Theorem 4 There is a simple efficient procedure A which takes values 



{Ai}) O'l^d a value M > /Umax as inputs and outputs a true mixture Z of k many 



n-dimensional (^max, o"nim' ^max)" bounded Gaussians with mixing weights tt , . . . ,7r satisfying 

(a) Yli=i 71"* = 1, and 

(b) ao < Z(x) < Po for all x G [-M,M]", 



1 • exp and (5o := l/{V2^a, 

\ min / - 



n 

mm J ■ 



where oq ■= 

L V 27r(Tmax 

Furthermore, suppose Z is a mixture of Gaussians X.^, . . . ,X'^ with mixing weights vr*, means 
fjLp and variances (c^)^ and that the following are satisfied: 

(c) for i = 1 . . . k we have |7r* — 7r*| < e„ts where Cwts 

< 1/(12A;)3; and 

(d) for all i,j such that vr* > Eminwt we have - < Emeans and Kcrp^ - (ap^l < evars- 
Then Z will satisfy KL(Z||Z) < r/(emeans, Cvars, e„ts, eminwt), where 



^2 

/ \ / '-vars 

^l^^meansi Cvarsj ^wts) ^minwt j • — ^ ' I 71 2 



^vars ^means ~^ ^vars 

mm V^mm '-vars; 

2 



+ ^£minwt • ri ■ [ 2 ) + l^/cE-^^^g. 



Proof: We construct a mixture Z of product distributions X^, . . . ,X'^ by defining new mixing 

and variances ((T*) 

1. For all i,j, set 



weights TT*, expectations /i*-, and variances (o"*)^. The procedure A is defined as follows: 



/^max ^ /^max 

= { yWmax if Aj- > /^max and aj 

o.w. 




<7min if ^ '^min 



<7max if ^ '^max 



aj O.W. 
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2. For alH = 1, . . . , A; let if^ = | "l] " ^"^'^ 

[ Ewts II VT < e„ts. 

Let s be such that s Yli=i ~ 1- Take vr* = stt*. (This is just a normahzation so the mixing 
weights sum to precisely 1.) 

It is clear from this construction that condition (a) is satisfied. For (b), the bounds on &j are 

easily seen to imply that X*(x) < l/(\/27r(Tmin)" =: Po for all x G R", and hence the same upper 
bound holds for the mixture Z(x), being a convex combination of the values X*(x). Similarly, 
using the fact that M > /Umax together with the bounds on /i* and a*-, we have that X*(x) > 

• exp ( ^ =: ao, for all x £ [— M, M]", and this lower bound holds for Z(x) as well. 

We now prove the second half of the theorem; so suppose that conditions (c) and (d) hold. Our 
goal is to apply the following proposition (proved in 8 ) to bound KL(Z||Z): 

Proposition 2 Let vr^, . . . , vr^, 7^, . . . , > he mixing weights satisfying X] ^* — 7* = 1- 
T = {i : vr* > e^}. Let P-*^, . . . , P'^ and Q^, . . . , be distributions. Suppose that 

1. Itt* — 7*1 < ei for all i G [A;]; 

2- 7* > £2 for all i £ [k]; 

3. KL(P^||Q*) < ej for all i € 1; 

4. KL(P^||Q0 < eaii for all i £ [k]. 

Then, letting P denote the n-mixture of the P* 's and Q the ^-mixture of the Q* 's, for any €4 > ei 
we have KL(P| |Q) < ej + A^eseaii + ke4 In — + 



€2 e4-ei 



More precisely, our goal is to apply this proposition with parameters 
ei = 3/cewts; €2 = ewts/2; eg = e^inwt; ex = n ■ (j^ + f^JT^-ZZ) ) ' = ' 



max I ^^^max 

— ^3 



mm 



To satisfy the conditions of the proposition, we must (1) upper bound |-7r* — 7r*| for all i; (2) lower 
bound 7T* for all i; (3) upper bound KL(X'||X*) for all i such that vr* > eminwt; and (4) upper bound 
KL(X*||XO for all i. We now do this. 

(1) Upper bounding \tt^ — t^^I- A straightforward argument given in |H] shows that assuming 
ewts < 1/(2A:), we get Ivr* - 7r*| < 3A:ewts- 

(2) Lower bounding tt*. In |H] it is also shown that tt* > assuming that ewts < 

(3) Upper bounding KL(X*||X*) for all i such that vr* > eminwt- Fix an i such that vr* > eminwt 
and fix any j G [n] . Consider some particular fj.'j and /i* and a* and fi*- , so we have | /i* — Aj I ^ ^means 
and |(o-p^ - ((Tp^l < evars- Since < /^max, by the definition of /i* we have that l/^j- - Aj| < emeans, 
and likewise we have Kcr])^ — < Cvars- Let P and Q be the one-dimensional Gaussians with 
means /i*- and fi) and variances a) and o"*- respectively. By Corollary |1J we have 



KL(P||Q) <^^ + 



+e 

"-means ' 



2(T^. ' 2(cj2. 

mm V mm 
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Each X' is the product of n such Gaussians. Since KL divergence is additive for product distribu- 
tions (see Proposition |SJ WG havG tliG following bound, for ecich. i such, that tt^ ^ ^minwt* 



KL(X*||X*) <n - ( + 



^vars ^means ~^ ^vars 

2cr2 ■ 2(0-2 ■ - evarsi 

mm V mm ^vars; 



(4) Upper bounding KL(X*||X*) for all i G [A;]. Using the fact that both X* and X* are 
(/^max, c^in) f''max)"bounded, it follows from Fact El and Proposition [S] that we have 



'^max ~^ 2/^max 
mm 



KL(X*||X*) < n 
Proposition [21 now gives us 

j^T /rr\\'y\ ^ „ f ^vars ^means + ^vars A , , _ f ^max ~^ '^f^max\ , r> 

KL(Z||Z) < n • ( — ^ h —2 — r I + Keminwt ■ n ■ \ ^ ) + 

V^CTmin ^(.Crmin ~ ^varsj/ V '''mm / 

where R = ke^lnf + = |e^/^ ln(e;^Y') + a/a'ft: ' ^smg the fact that Inx < x^'^ for 

a; > 1, the first of these two terms is at most fe^^g. Using the fact that e^ts < 1/(12A:)^, the second 
of these terms is at most llke]^^^. So R is at most ISfce^jg and the theorem is proved. ■ 



5.2 Getting a list of distributions one of which is KL-close to the target 

In this section we show that combining the conversion procedure from the previous subsection with 
the results of Section |1] lets us obtain the following: 

Theorem 5 Let Z he any unknown mixture of k = 0{1) axis-aligned Gaussians owerR" . There is 
an algorithm with the following property: for any €,5 > 0, given samples from Z the algorithm runs 
in poly(nL/e) •log(l/(5) time and with probability 1 — 5 outputs a list o/poly(nL/e) many mixtures 
of Gaussians with the following properties: 

1. For any M > /Umax such that M = poly(nL/e), every distribution Z' in the list satisfies 
exp(-poly(nL/e)) < Z'(x) < poly(L)" for all x G [-M,M]". 

2. Some distribution Z* in the list satisfies KL(Z||Z*) < e. 

Note that Theorem [SI guarantees that Z'(x) has bounded mass only on the range [— M, M]", 
whereas the support of Z goes beyond this range. This issue is addressed in the proof of Theorem[71 
where we put together Theorem [S] and the maximum likelihood procedure. 

Proof of Theorem [SJ We will use a specialization of Theorem [31 in which we have different 
parameters for the different roles that e plays: 

Theorem [31 Let Zi be a mixture of k = 0(1) axis-aligned Gaussians X^,...,X^ over R", de- 
scribed by parameters ({tt*}, {/u* }, {cr*}). There is an algorithm with the following property: for 
any Smeans > ^vars , Cwts , Cminwt , > 0, given Samples from Z, with probability 1 — 5 it outputs a list of 
candidates ({tt*}, {/i* }, {ct* }) such that for at least one candidate in the list, the following holds: 

1. Ivr* — 7r*| < Ewts for all i £ [k]; and 

2. l/i* - I < Emeans and - (dp^l < e^ars for all i,j such that vr* > emmwt- 
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The algorithm runs in time poly(nL/e') • log(l/5) where e' = min{ewts> Emeans) ^varsi eminwt}- 

2 

Let e,6 > he given. We run th.6 cLlgorith.ni of Theorem with pctrameters ^means — ~r27T^' 

Cvars = 2emcans, Cminwt = 3fcra(cr,{^^+Vinax) ^ (sMF" ^^^^ ^^^^^ parameters the algorithm 

runs in time poly(nL/e) • log(l/5). By Theorem EI, we get as output a hst of poly(nL/e) many 
candidate parameter settings ({tt*}, {/x* }, {cr]}) with the guarantee that with probabihty 1 — 5 at 
least one of the settings satisfies 

• Ivr* — 7r*| < e„ts for all i ^ [k], and 

• l/i* - I < eineans and |((Tp2 " (^j)^l ^ ^vars for all i,j such that vr* > eminwt- 

We now pass each of these candidate parameter settings through Theorem [l] (Note that 
Ewts < 1/(12A;'^) as required by Theorem^) By Theorem ^ for any M = poly(nL/e) all the 
resulting distributions will satisfy exp(— poly(nL/e)) < Z'(x) < poly(L)" for all x £ [— M, M]"". 
It is easy to check that under our parameter settings, each of the three component terms of r] 
(namely n ■ + 'fT"°"^'""\ ) , ^eminwt • n ( ^iax+Miax ) ^ ^^d 13A:eMg) is at most e/3. Thus 

??(emeans, Evars, Ewts, Cmiawt) < SO at least One of the resulting distributions Z* satisfies KL(Z||Z*) < 
e. 



6 Putting it all together 

6.1 Identifying a good distribution using maximum likelihood 

Theorem Ogives us a list of distributions at least one of which is close to the target distribution we 
are trying to learn. Now we must identify some distribution in the list which is close to the target. 
We use a natural maximum likelihood algorithm described in |S] to help us accomplish this: 

Theorem 6 |^ Let P, a, e > be such that a < (3. Let Q be a set of hypothesis distributions for 
some distribution P over the space X such that at least one Q* G Q has KL(P||Q*) < e. Suppose 
also that a < Q{x) < P for all Q £ Q and all x such that P(x) > 0. 

Run the ML algorithm on Q using a set S of independent samples from P, where S = m. 

Then, with probability 1 — 5, where 5 < (|Q| + 1) • exp (^"^m^^^^^^^y^^ , the algorithm outputs some 
distribution Q^^ G Q which has KL(P||Q^L) < 4e. 

6.2 The main result 

Here we put the pieces together and give our main learning result for mixtures of Gaussians. 

Theorem 7 Let Z be any unknown mixture ofk n-dimensional Gaussians. There is a {uL / e)^^^'^'^ ■ 
log(l/5) time algorithm which, given samples from Z and any e,5 > Q as inputs, outputs a mixture 
Z' of k Gaussians which with probability at least 1 — 5 satisfies KL(Z||Z') < e. 

Proof: Run the algorithm given by Theorem |SJ With probability 1 — 5 this produces a list of 
T = {uL/e)^^^ ^ • log(l/(5) hypothesis distributions, one of which, Z*, has KL divergence at most e 
from Z and all of which have their pdfs bounded between exp(— poly(nL/e)) and poly(L)" for all 
X G [— M, M]", where M > //max is any poly(nL/e). 

We now consider Zjvf, the M-truncated version of Z; this is simply the distribution obtained 
by restricting the support of Z to be [—M,M]"^ and scaling so that Z^/ is a distribution (see 
Appendix^for a precise definition of Z^f). We prove the following proposition in Appendix IdI 
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Proposition 3 Let P and Q be any mixtures of n- dimensional Gaussians. Let Pm denote the 
M-truncated version of P. For some M = poly(nL/e) we have |KL(Pjv/||Q) — KL(P||Q)| < 
4e + 2e-KL(P||Q). 

This proposition implies that KL(Za/||Z*) < 7e. 

Now run the ML algorithm with m = poly(nL/e) log(M/5) on this list of hypothesis distribu- 
tions using Zjv/ as the target distribution. (We can obtain draws from Zjv/ using rejection sampling 
from Z; with probability 1 — 5 this incurs only a negligible increase in the time required to ob- 
tain m draws.) Note that running the algorithm with Zm as the target distribution lets us assert 
that all hypothesis distributions have pdfs bounded above and below on the support of the target 
distribution, as is required by Theorem |21 (In contrast, since the support of Z is all of R", we 
cannot guarantee that our hypothesis distributions have pdf bounds on the support of Z.) By 
Theorem 1^1 with probability at least 1 — 6 the ML algorithm outputs a hypothesis such that 
KL(Zm||Z^L) < 28e. 

It remains only to bound KL(Z||Z^'^). By Proposition 01 we have 

KL(Z||Z^^) < 28e-h4e-h2e-KL(Z||Z^^) 

which implies that KL(Z||Z^L) < 33e. The running time of the overall algorithm is [uL / e)'~'^^^^ ■ 
log(l/5) and the theorem is proved. ■ 



7 Extensions to other distributions 

In this paper we have shown how to PAC learn mixtures of any constant number of distributions, 
each of which is an n-dimensional Gaussian product distribution. This expands upon the work by 
Feldman et al. which worked for discrete distributions in place of Gaussians. It should be clear 
from our work that in fact many "nice" univariate distributions can be handled similarly. Also, it 
should be noted that the n coordinates need not come from the same family of distributions; for 
example, our methods would handle mixtures where some attributes had discrete distributions and 
the remainder had Gaussian distributions. 

What level of "niceness" do our methods require for a parameterized family of univariate dis- 
tributions on R? First and foremost, it should be amenable to the "method of moments" from 
statistics. By this it is meant that it should be possible to solve for the parameters of the distri- 
bution given a constant number of the moments. Distributions in this category include gamma 
distributions, chi-square distributions, beta distributions, exponential — more generally, Weibull 
— distributions, and more. As a trivial example, the unknown parameter of an exponential dis- 
tribution is simply its mean. As a slightly more involved example, given a beta distribution with 
unknown parameters a and (3 (the pdf for which is proportional to x"~^(l — x)^~^ on [0, 1]), these 
parameters can be determined from mean and variance estimates via 

So long as the univariate distribution family can be determined by a constant number of moments, 
our basic strategy of running WAM multiple times to determine moment estimates and then taking 
the cross-products of these lists can be employed. 

There are only two more concerns that need to be addressed for a given parameterized family of 
distributions. First, one needs an analogue of Proposition ^ showing that products of independent 
random variables from the distribution family are efficiently samplable. (In fact, this should hold for 
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mixtures of such, but this is very hkely to be impHed in any reasonable case.) This immediately holds 
for any distribution with bounded support; it will also typically hold for "reasonable" probability 
distributions that have pdfs with rapidly decaying tails. 

Second, one needs an analogue of Theorem^! This requires that it should be possible to convert 
accurate candidate parameter values into a KL-close actual distribution. It seems that this will 
typically be possible so long as the distributions in the family are not highly concentrated at any 
particular point. The conversion procedure should also have the property that the distributions 
it output have pdfs that are bounded below/above by at most exponentially small/large values, 
at least on polynomially-sized domains. This again seems to be a mild constraint, satisfiable for 
reasonable distributions with rapidly decaying tails. 

In summary, we believe that for most parameterized distribution families "Z?" of interest, per- 
forming a small amount of technical work should be sufficient to show that our methods can learn 
"mixtures of products of D's". We leave the problem of checking these conditions for distribution 
families of interest as an avenue for future research. 
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A Notational convention on Gaussians 

Recall that all Gaussians we consider are (/Xmax, ^min' ^max) "bounded. In dealing with Gaussians it 
will be very useful to define a function M{6) which satisfies 

/ X(.)<ix < 0. f |.|X(.)d. < B, a,.d [ .^X(.)dx < B 

J\x\>M J\x\>M J\x\>M 

for any one-dimensional (/Xmax, fmin' ^max) "bounded Gaussians X. Straightforward arguments show 
that this can be achieved with M{9) = poly{L/9). 

Notational convention: Throughout the appendices M{9) = poly(L/0) denotes a function satis- 
fying the conditions above. 

B Proof of Proposition [T] 

Proof: We shall prove the proposition for W^; the proof for W is similar but slightly simpler. 

Let the mixing weights be vr^, . . . , tt'^ and suppose that Zj is a mixture of Xj, . . . , X^ for j = 1,2. 
Let s = E[W2]. 

Recall the quantity M = M{6) and take C = = poly(L/6'). Let denote the random 
variable conditioned on the event |W^| < C. Observe that 

Pr[w2 > C] = Pr[w2 > M^] < Pr[|Zi| > M] + Pr[|Z2| > M] < 20, (1) 

using the fact that Zi and Z2 are (^max, c^in) '^max) "bounded Gaussians and the definition of M. 

We shall show that |E[W^] — s\ < e/2. Our sampling algorithm for will be to sample 
from using rejection sampling and to compute and output the empirical mean of W^. Since 
the random variable is bounded in the range [— C, C], by the Hoeffding bound if we take 
poly(C/e) • log(l/(5) = poly{L/e9) ■ log(l/5) samples from then with probability 1 — 5 the 
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empirical mean of will be within e/2 of the true mean E[W^]. (Technically, we must also note 
that since 9 is much smaller than 1 we can do rejection sampling with very little slowdown.) Thus 
it remains to show that indeed |E[(Wc)^] — s| < e/2. 

Observe that E[(Wc)^] = ELi^*E[(Wc)^ | i is chosen] and s = Ei'=i^*E[W2 | i is chosen]. 
Thus by convexity it is sufficient to prove \E[{X\f(Xi)^ \ < C] -E[(Xi)2(X*2)2]| < e/2 

for all i = 1 . . . k. For simplicity we now write Xj = X*- for j = 1,2. Recall that Xi and X2 are 
one-dimensional (^max; cr^jn, o"^ax)"t'Ounded Gaussians. 

Let p{w) be the pdf for the random variable (Xi)^(X2)^. Note that 



/ 



\w\>C 



wp{w)dw 



X2Xi(xi)X2(x2)lixi(iX2 



< 



/ / (l{|xi|>Ci/4} + l{|^2|>ci/4})x?xiXi(xi)X2(x2)(ixidx2 

J Xl J X2 



X2 



\xi\>M 



x^Xi(xi)(ixi 



+ 



x^Xi(xi)dxi 



\X2\>M 



E[(X2)' 



|a;i|>A/ 

+E[(Xi)2] 



xfXi(xi)dxi 



\X2\>M 



xl'X2{x2)dX2 



< 2L 

< 461^ 



/ x^Xi(xi)(ixi + / xl'X2{x2)dx2 ] 

\J\xi\>M J\X2\>M J 



(2) 



using the definitions of M and L. 

Let 7] = — Pr[(Xi)^(X2)^ > C]) — 1, so 77 < 30 using the same argument as in Note 
that the pdi pc{w) for the random variable (Xi)^(X2)^ conditioned on |(Xi)^(X2)^| < C is given 

by 

(1 + r])p{w) if \w\ < C, 
^ if \w\ > C. 

Let t = E[(Xi)2(X2)2]; finally, we can show that |E[(Xi)2(X2)2 | (Xi)2(X2)2 < C] - t| < e/2, as 
desired: 



pc[w) 



|E[(Xi)2(X2)2 



(Xi)2(X2)'<C]-t| 



/ wpc 


w) — wp{w) 










(1 + ^)^ 


/ wp{w) — 


f ' 




'\w\<C J 


|w|<C 


V / 


wp{w) — 1 


wp{w) 


J\w\<C 


J\w\>C 





wp{w) 



wp{w] 



\w\>C 



< r]t + e< {39)poly{L) + 9, 

once more using the definition of M (note: C > M). Choosing 9 = poly(e/-L), we get that this is 
bounded by e/2; consequently M = poly(L/e) and the sampling time is as claimed. ■ 
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C Auxiliary facts about KL divergence 

The following fact gives the KL divergence between two univariate Gaussians; it can be found in, 

e.g., m- 

Fact 8 Let P,Q each be a one- dimensional normal distribution with means and variances /ip,crp 
and /xq,(Tq respectively. Then we have 



KL(P||Q) = -ln^ + 



2 U?. 2^2 



An easy consequence is the following bound on the KL divergence between two Gaussians: 

Corollary 4 Let P, Q be one- dimensional Gaussians as above and suppose that l/^p — /Uq| < Cme 
I^P ~ ^^qI < ^vars, and > a"^-^^. Then 



KL(P||Q) < + + 



2cj2. ' 2(cj2. 

min \ mm 



Proof: We have 



which implies 



2 2 

^ '^min ~^ ^vars _ , ^vars 

2 — 2 2 

"^P '^min '^min 



1 . / 



In ^ < 



2 \ 

Q \ , Cvars 



2 V-i.; -2a^,,- 

The bound easily follows observing that cTq > cr^j^j — evars- B 

Proposition 5 Suppose Pi, . . . ,P„ and Qi, . . . , are distributions satisfying KL(Pj||Qj) < 
for all i. Then KL(Pi x • • • x P„| |Qi x • • • x Q„) < Y,7=i^i- 



Proof: We prove the case n 



KL(Pi XP2IIQ1 X Q2) = ^yPi(x)P2(y)ln^ 



Pi(a;)P2(y) 



dxdy 



{x)Ct2{y) 

II P,{x)P2{y)ln^l^dxdy + II P,{x)P2iy)ln^^dxdy 
I P2(y)KL(Pi||Qi)dy + ^ Pi(x)KL(P2||Q2)dx 



< ei + e2- 
The general case follows by induction. 
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D Truncated versus untruncated mixtures of Gaussians 



Definition 5 Let ^ be a distribution over R". The M-truncated version of X is the distribu- 
tion Xm obtained by restricting the support of X to be [— M, M]" and scaling so that X^/ is a 
distribution. More precisely, for x £ we have 



if\\x\\^>M, 
cJ^{x) if \\x\\oo < M 



where c = 1/ 



x(^[-M,MY' 



„X(2;)j is chosen so that JXjv/(2;) = 1. 



In this section we prove Proposition |3| 

Proposition 121 Zei P and Q be any mixtures of n- dimensional Gaussians. Let Y*m denote the 
M-truncated version of P. For some M = poly (nL/e) we have |KL(Pjv/||Q) — KL(P||Q)| < 
4e + 2e-KL(P||Q). 

Proof: We will take M = M{6) (recall Appendix^. As we go through the proof various conditions 
will be set on 6. At the end of the proof we will see that we can take 6 = poly(e/nL) and obtain 
the desired bound on |KL(PAf||Q) — KL(P||Q)| and satisfy all the conditions on 9. This proves 
the theorem. 

We have that Pm(2;) satisfies 



(l + (5)P(x) ifxG[-M,M]", 
ifx0[-M,M]", 



where 5 > is chosen so that 



6[-Af,Af]' 



, P(x). Using the definition of M we have 



x^[-Af,Af]" 



P(x) = Pr[x [-M,M]"] < ^Vi[\xj\ >M\<ne<e 



where we have used the fact that 6 < e/n (this is our first condition on 9). Consequently we have 
> 1 - e, so (5 < 2e. 
We have 

|KL(Pm||Q)-KL(P||Q)| 

P(3;)ln^ 



,i+.-)P(.)...<i^ 

x&[-M,MY' 

[I + 5) ln(l + 5) ( V{x) + 5 I 



P(x)ln^ 



P(;r)ln^ 

x^[-A/,A/]" 



< {1 + 5)ln{l + 6) + 6 
= 5(1 + 5) + 5\R\ + \S\ 



P{x) 



P(x)ln 

xe[~M,M]"- 



where R . J^(z\— 

M,M]" ^^^^ ™^ ^ ■~ L<^[-M,M]" -^(^^ ^R' succinctness let k denote 

KL(P||Q). Note that we have k = R + S. 

Suppose we show that 15"! < e. Then since k = i? + S*, we must have \R\ < k + e, and hence 
|KL(Pa/||Q) — k\ < (5(1 + 5) + 6{k + e) + e < 4e + 2eK (using 5 < 2e), as desired. Thus we can 
complete the proof by showing |S| < e. 



+ 



x(^[~M,M]" ^[X) 



P{x) 
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Let us analyze the integrand of S. Decompose P into its mixture components, i.e. P(2;) 
Ef=i7r*P*(2;), where P\ P'^ are n-dimensional Gaussians. Hence 



vr 



P*(x)ln 



P(x) 



x^[-M,Af]fe Q(^) 
Fix), 



We will show that for each i we have I jvf]* ■^'(^) — ^- then follows that \S\ < e 

since IS*] is upper bounded by a convex combination of these quantities. 

Let us now analyze the quantity In ^|fy- We will show that for any x ^ [—M, M]'^, neither P(x) 



|2. 



hence I In 



P(x) 



will be 



nor Q(x) can be either "too small" or "too large" as a function of i|j^ii2) '^^^^•-^^ i q{x) i 
of moderate size. We will prove this for Fix) using the fact that it is a mixture of n-dimensional 
(/^max, CTmiri' ^max) "bounded Gaussians; since this is also true of Q(2;), the same bound will hold for 
it. 

We will show that for all i = 1, . . . ,k and all x £ R" we have P*(x) G [ti^), T] where T is 
a quantity and t{x) is a function that will both be defined below. Since P{x) = Yli=i ''^^^^ i^) 
a convex combination of the P*(x)'s, the same bound will hold for P(2;). Fix any i and consider 



the Gaussian P*. Since this Gaussian is axis-aligned, we have P^{x) = Y\ 



3=1 '^M, 



2na- 



exp 



^fj2{xj) for some 
]. (Here 4>^^a'^{x) is the usual pdf 
'^^•1^ ) ^ one-dimensional Gaussian.) It is easy to see that for any Xj, 



pairs (/ii, cjf ), . . . , (/U„, satisfying < /imax, o-| G to"; 



2- cj2 , 

mm' maxJ 



1 



27r(Tn 



■ exp 



< 



1 



Hence for all x G R" we have 
exp 



t{x) 



/^max 



/'^min) 



27r(Tr, 



exp 



0'„ 



< P*(x) < 



27r(Tr, 



T 



(3) 



for all i, and so (jH)) holds true for P(x) as well. As stated earlier, the same argument also shows 



that holds for Q{x 
P(x) 



In 



We conclude that for any x, 
lnt(x) \ + I InTl 



< 



-n- 



. A^max 



< o 



mm 
2 



n- 



nln{V 27ra 



In- 



-hnln(l/V27rcr: 



Recall that we want to show | 



x^[-M,M]' 



< e. It clearly suffices to show that 



In qI^I I < e. By the above it suffices to show 



O 



n- 



In 



x-^[-Af,M]" 



P^(x)||x||2 <e. 



We have 



x^[-M,MY 



P\x) 



\x\\l 



E 



a;0[-M,M]'^ 



P*(x)a; 



(4) 
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Fix j; we now bound f^^^_M m]" Recall that P*(x) = P\{xi) ■ • • P^(x„). We have 



» n 



lx^[-M,MV' 



^ JxeR":\xi\>M 



j 



/xeR":|a;j|>M 

For the first integral of @ above we have 



/ p'{x)x] + y2 [ p\x)x' 

A'eR":la;,-|>M AeR":|a;£|>A/ 



/ P^(x)xj^ = n / Piixe)dxe ] ■[ Pi{xj)x]dxj = [ Pi{xj)x]dxj 

Jx&R":\xj\>M L-'^^eR J / J\xj\>M J\xj\>M 

< (5) 
where the inequality is by the definition of M. For the second term of © above we have 



E 



P\x)x^i 



\xe\>M 



Pl{xi)dxi 



eR 



l^j JxeIl":\xe\>M 

where we have used the fact that for any I' which is neither i nor j we have 

Pl,{xe')dx£i = 1. 



(6) 



Xfi^H. 



Again using the definition of M to bound the integral over variable X£ in © above by 6, we have 
that © is at most 



(n - 1)61 / P]{xj)xjdxj = {n- l)(9Ep> [x^] = {n - 1)6 Varp> [x] + Ep, [x 



< (n - 1)^(ctLx + 



^ ) 

max/ 



(7) 



where the inequality holds since P*- is a one-dimensional (yUmax, fmin' '^max)"bounded Gaussian. 
Putting all the pieces together, we find that Q is at most 



n[9 + (n - l)0{a^^^ + fi^^^)] < n e{a^^^ + ^x 



^max) 



It follows that |5| < n'^0{al^.^^ + /U^g^^) • 0{'n?^^^ In f^). We can take = poly(e/nL) and have 
this quantity be at most e. ■ 
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